Skip to content

[SDBM-2746] Collect standard system-table metrics per node in single endpoint mode#24266

Open
sangeetashivaji wants to merge 9 commits into
masterfrom
sangeeta/clickhouse-single-endpoint-metrics
Open

[SDBM-2746] Collect standard system-table metrics per node in single endpoint mode#24266
sangeetashivaji wants to merge 9 commits into
masterfrom
sangeeta/clickhouse-single-endpoint-metrics

Conversation

@sangeetashivaji

@sangeetashivaji sangeetashivaji commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Makes the standard (non-DBM) ClickHouse system-table metric collection cluster-aware when single_endpoint_mode is enabled. The bulk-match metric queries (system.events, system.metrics, system.asynchronous_metrics) and system.errors are now routed through clusterAllReplicas('default', system.<table>) and tagged with hostName() AS clickhouse_node, so each cluster node gets its own metric series.

system.parts, system.replicas, and system.dictionaries use GROUP BY and are intentionally left unchanged here (follow-up).

Motivation

On a multi-node ClickHouse Cloud cluster reached through a single load-balancer endpoint, the standard metric queries read per-node system tables directly.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add qa/required if this PR needs QA validation, or qa/skip-qa if it does not. Exactly one of the two is required.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

In single endpoint mode the agent reaches a multi-node ClickHouse cluster
through one load balancer, but the standard metric queries read per-node
system tables directly (bare system.events/metrics/errors). Consecutive
scrapes land on different nodes, so cumulative counters appear to jump up and
down and the backend reports phantom monotonic_count spikes such as the false
clickhouse.query.failed.count in SDBM-2746.

Route the bulk-match metric queries and system.errors through
clusterAllReplicas() and tag each row with hostName() when single_endpoint_mode
is enabled, giving each node its own metric series (counters increment
monotonically per node; gauges stop flapping). Direct connections are
unchanged. system.parts/replicas/dictionaries use GROUP BY and are left for a
follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@datadog-datadog-prod-us1-2

datadog-datadog-prod-us1-2 Bot commented Jun 30, 2026

Copy link
Copy Markdown

Tests  Code Coverage

🎉 All green!

🧪 All tests passed
❄️ No new flaky tests detected

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 94.00% (+5.99%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 3054c6f | Docs | Datadog PR Page | Give us feedback!

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sangeetashivaji sangeetashivaji added the qa/required QA is required for this PR and will generate a QA card label Jun 30, 2026
sangeetashivaji and others added 2 commits June 30, 2026 16:20
Replace the manual search + start()/end() slicing with a single
SYSTEM_TABLE_FROM_CLAUSE.subn() pass (count=1). The regex now consumes
the whitespace before FROM so the per-node projection splices in without
a gap before the comma, keeping the rewritten query byte-for-byte
identical. Output and behavior are unchanged; all unit tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ruff 0.11.10 reformats the parenthesized assert condition back to the
trailing-message form. Apply it so `ddev test --fmt` / CI lint passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sangeetashivaji sangeetashivaji marked this pull request as ready for review June 30, 2026 20:41
@sangeetashivaji sangeetashivaji requested review from a team as code owners June 30, 2026 20:41
@sangeetashivaji sangeetashivaji changed the title Collect standard system-table metrics per node in single endpoint mode [SDBM-2746] Collect standard system-table metrics per node in single endpoint mode Jun 30, 2026
sangeetashivaji and others added 5 commits July 1, 2026 11:12
Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Instead of rewriting system-table metric queries with a regex at runtime,
define static cluster-aware query variants and select them in get_queries()
based on single_endpoint_mode.

- Add cluster_aware_query() helper in utils.py that reuses the base query's
  columns by reference (no duplication) and appends the clickhouse_node tag.
- Define legacy variants in queries.py and advanced variants in
  advanced_queries (JSON-backed via load_match_query + SystemErrorsClusterAware).
- Remove import re, SYSTEM_TABLE_FROM_CLAUSE, and make_cluster_aware().

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cluster_aware_query now takes only the base query dict and derives the
SELECT list and table from base['query'], so call sites no longer restate
column lists or table names. load_match_query reverts to its original
signature; __getattr__ wraps the loaded query for ClusterAware names.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the static *ClusterAware definitions and the __getattr__ suffix
handling; get_queries wraps each base query via cluster_aware_query() when
single_endpoint_mode is on. queries.py and advanced_queries revert to base;
the change now touches only utils.py, clickhouse.py, and the tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dd-octo-sts

dd-octo-sts Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Validation Report

All 21 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and code coverage settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
qa-label Validate the pull request declares whether it needs QA for the next Agent release
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant